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Abstract 


In this paper I describe my experience porting a code called FLO67 from Fortran-77 
to CM-Fortran for the Connection Machine. The Fortran-77 code was written by Antony 
Jameson and was optimized for the Convex. The code solves the Euler equations for flow 
past a swept wing by a multigrid, multistage time stepping scheme with residual averaging. 
The Connection Machine code is a combination of CM-Fortran and Fortran-77. Currently, 
the code runs 15 times faster on a 64 K CM-2 compared to a two head Convex. In this 
document, I have attempted to extract some general features of the porting process that 
might be valuable to other people porting codes to CM-Fortran. Your comments are 
welcome. They should be directed to gyan@think.com. 


© 1992 
Permission is granted to copy 
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How To Begin? 


In what follows I will presume that you know CM-Fortran - or at least that you have 
a CM-Fortran manual and you are comfortable with looking things up in it. Now imagine 
that someone hands you 5,000-10,000 lines of Fortran-77 code to port to CM-Fortran. 
How do you begin? I will describe my experience of doing this for a code called FL067 [1]. 
My emphasis will be on conveying what I learned about the porting process and not on 
Computational Aerodynamics [2]. 

The original FLO67 code was written in Fortran-77 by Professor Antony Jameson of 
the Department of Aerospace Engineering at Princeton University. That code was opti- 
mized for the Convex. The code solves the Euler equations (basically a set of conservation 
laws for energy and momentum) for air flow past a swept wing. The method used is a 
multigrid, multistage time stepping scheme with residual averaging. The grid is three di- 
mensional, irregular (meaning that the bond lengths are not the same everywhere) and 
fixed in time (non-adaptive). In simpler terms, one has a three dimensional spatial volume 
with a wing in it. This volume is divided up into cells which are deformed cubes. Inflow 
and outflow boundary conditions are specified. The Euler equations are solved in integral 
form by balancing energy and momentum fluxes in and out of the faces of the cubes. 

The code works on several grids in a multigrid scheme. As one goes from the coarsest to 
the finest grid, the number of points in the grid increases by a factor of eight for successive 
grids. Typically the number of grid levels is four to five. There is a well defined procedure 
to propogate information from one multigrid level to the next. The multigrid scheme is 
very important in bringing the system to a steady state quickly. This is because the flow 
of information between grid levels helps to equilibrate long wavelength modes at the same 
speed as the short wavelength modes. The result is that large scale structure in the steady 
state solution come to equilibrium faster with the multigrid scheme. Roughly, for flow near 
Mach 1, about 40 multigrid steps are equivalent to 2000 steps without multigrid. 

I decided to port the whole code - so that it reads the same input DATA file and it 
generates the output file in the same format as the original Fortran-77 version. All the 
special features built into the original Fortran-77 version were kept intact. 

The first step in the porting process is to find a problem size that allows one to run 
the code entirely on the front end. The code is then run with a profiler option to determine 
where all the compute time is spent. This is done by using the -pg option of the Fortran-77 
compiler and then using a utility called gprof. The necessary commands to do this are 
listed below. I assume that the Fortran-77 source code is called flo67.f and that it reads a 
data file called DATA at run time. 

{77 -pg flo67.f -o flo67x 
flo67x < DATA > output 
gprof flo67x > prof 

When step 2 is completed, there will be a file called gmon.out in the local directory 
which contains information used by the gprof utility. The file ‘output’ should be carefully 
preserved as it will be used later in the porting process. At the end of step 3, the file ‘prof’ 
contains information about timing and flow control in the program. In particular, it is 
possible to extract from ‘prof’ the percentage of time spent in each routine as well as the 
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‘code tree’. These are shown in Table 1 and Figure 1 respectively for FLO67. 

Note that the percentages add up to a number bigger than 100. This is because the 
time I have used in making Table 1 is the TOTAL time spent in a routine, including 
time spent in routines called from it. In Figure 1 the arrows represent a parent/child 
relationship. An arrow from routine A to B means that routine B is called by routine A. 
It is clear from Table 1 and Figure 1 that almost all the time is spent in FLO and routines 
called by FLO. I ended up porting all the routines that are listed in uppercase in Figure 1 
to CM-Fortran. The remaining routines remained as they were in the original Fortran-77. 

There are a set of routines in Figure 1 that are special. These are the so called ‘leaf’ 
routines. A leaf routine is one that is called by one or more routines but does not itself 
call any routine. They are identified in Figure 1 as routines with only inward directed 
arrows. ‘The reason that these routines are special is that they can be ported one at a 
time, independent of the other routines. 


The Porting Strategy 


The porting process begins with moving the leaf routines to CM-Fortran. To illustrate, 
consider the leaf routine INIT. The initial Fortran-77 source-code for this is shown in 
Table 2. I call this INIT-START. In INIT-START, W(1,J,K,L) is the flow field. I,J,K label 
lattice points on a cubic grid. L is an attribute label. L=1 gives the density p, L=2,3,4 
gives the momentum density (pu, pv, pw) and L=d is the energy density ; p(u* +07 +w?). 
Note that the array dimensions are (0:IE,0:JE,0:KE). The actual space over which the flow 
field is defined however, is a smaller domain than this. It extends from (1:NX,1:NY,1:NZ) 
with NX=IE-2, NY=JE-2 and NZ=KE-2. The edges of the I,J,K coordinate system are 
used for boundary conditions (we will come back to this issue later). Finally, the field P 
is the pressure. 

Note that the common blocks in INIT-START contain only scalar, global quantities. 
The large arrays W and P are passed as subroutine arguments. This is the preffered style 
of programming in Fortran. In general, it is NOT a good idea to use common blocks at 
all as this is inconsistant with modular programming. On the Connection Machine and 
for CM-Fortran, this is even more true. The reason is that arrays defined in common are 
placed on the HEAP in memory and their size cannot be changed at run time. On the 
other hand, arrays passed as subroutine arguments are placed on the STACK and their 
size can be determined at run time. To implement a multigrid scheme, it is crucial to 
utilize this flexibility of array size. 

INIT-START initializes the flow field and pressure to their asymptotic values (far from 
the wing). To port it to CM-Fortran while leaving the other routines unchanged, one must 
do the assigning of values for W and P on the Connection Machine rather than on the 
front end. The program that replaces INIT-START and achieves this is shown in Table 3. 
It is called INIT-MID since it is only an intermediate step. 

The way INIT-MID works is as follows: 

The entry into the routine INIT is as before, using the front end arrays W and P. 
When the program is executed however, the field values are written into the arrays WCM 
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and PCM instead of W and P using parallel CM-Fortran instructions. WCM and PCM are 
explicitly declared to live on the Connection Machine using CMF$LAYOUT commands. 
Once WCM and PCM contain the correct values, their contents are transferred to the front 
end arrays W and P respectively using the ‘CMF_FE_ARRAY_FROM_CM’ commands. 
Several things should be noted. 

a. The serial index for WCM (the index labelling the field attributes) is the first index 
while for W it was the last index. The way Fortran lists array values in memory is 
to run over the left most index first followed by the next index on the right and so 
on - the so called column major layout. This means that it is sensible to have the 
‘attribute’ index as the last index because this keeps data corresponding to a single 
attribute sequential in memory. Unfortunately, to get good performance out of the 
current CM-Fortran compiler, one needs to have the serial index first. 

b. Had this routine used the values of W and P that came in to it, one would have had to 
move the corresponding values from the front end to WCM and PCM. This would be 
achieved by the ‘CMF_FE_ARRAY_TO_CM’ commands which have been commented 
out. In INIT, this transfer is unnecessary because W and P are the outputs of INIT and 
so their input values are irrelevant. In porting other routines, one would first transfer 
the input arrays to the Connection Machine using ‘CMF_FE_ARRAY_TO_CM’ com- 
mands and then transfer all output arrays back using ‘CMF_FE-ARRAY_FROM_CM’ 
commands. 

The makefile that allows one to compile and run the program with only INIT ported 
to the Connection Machine is shown in Table 4. Note that the common blocks are defined 
explicitly on the front end. Only CMinit.fem, the file containing INIT-MID, is compiled 
with the cmf compiler, everything else is compiled under Fortran-77 as before. 

The strategy for porting all the leaf routines is summarized in Figure 2. The step which 
checks that the output from the run on the Connection Machine matches the output from 
the run on the front end is extremely crucial. Most errors (compiler and programming) are 
caught at this stage. It is clear that the strategy I recommend is a very conservative one. 
However, I have found that in the long run, it is the best one. This is because it is almost 
guaranteed to succeed and it minimizes the time spent chasing programming and compiler 
bugs during the porting process. Since the changes in each phase are incremental, it is 
easy to isolate any bugs that creep in. | 

Once the leaf routines are ported, then one has to move to porting a parent routine. 
For example, once the set of leaf routines METRIC, STEP, BOUND, XPAND, INIT and 
COLLC are ported, the next step is to port FLO (see Figure 1). This is done by moving all 
the CMF_FE_ARRAY-_-TO_CM and CMF_FE_ARRAY_FROM_CM calls from the ported 
leaf routines to the parent routine. This means that the transfers of the contents of the 
arrays to and from the Connection Machine are now made in the parent routine rather 
than in the leaf routine. The leaf routine is passed the Connection Machine based arrays 
directly. The routine INIT looks much more compact after this step. It is shown in Table 5 
(I call it INIT-FINAL for obvious reasons). Note that I have removed the front end arrays 
of INIT-MID and renamed WCM and PCM to W and P. It is also important to continue 
to check that the code generates the same output file after each parent routine is ported. 


Gradually, moving up the code tree, I reached a stage when all the transfers to and 
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from the Connection Machine to the front end were done from MAIN. At this point, I 
modified the MAIN to make all the arrays in it live on the Connection Machine. The only 
subtlety now was the the various printing and monitoring routines called from MAIN and 
FLO had to be passed front end arrays because they were still in Fortran-77 format. This 
meant that there still had to be some data moved from the Connection Machine to the 
front end. Fortunately, the time spent doing this was immeasurably small. 


At this point, the code is ported because all the compute bound parts of the code run 
on the Connection Machine. 


All about Grids and Multigrid 


The grid for FLO67 is computed in gmesh (see Figure 1) and is illustrated in Figure 3. 
It is a so-called C Grid. The mapping from I,J,K to X,Y,Z is done using a mapping function 
which is an array called GRID(0:IE,0:JE,0:KE,3). The X coordinate of the point (I,J,K) is 
GRID(I,J,K,1), the Y coordinate is GRID(I,J,K,2) and the Z coordinate is GRID(I,J,K,3). 
Thus the array GRID plays the role of a ‘metric’ for the lattice coordinates. It is extremely 
convenient to have such a mapping as it allows one to map a non uniform grid in (X,Y,Z) 
space to a uniform lattice labelling scheme (1,J,K). Indeed, the grid for FLO67 had many 
more points near the wing than far away from the wing. This is because the flow fields 
W and P have a great deal of short distance structure near the wing while their values 
far from the wing are slowly varying. Thus a non-uniform grid is absolutely crucial to 
resolving all the relevant structures in the flow field. 


In Figure 3 we see the (X,Y) plane of a grid with NX=12, NY=4 for a fixed value 
of Z. The Z planes are parallel to and equidistant from the X,Y plane shown. The lattice 
coordinate axis I starts at the extreme right center, wraps around the wing and returns 
to its starting point. This means that the metric for this grid has a coordinate singularity 
along the line where the I axis falls on top of itself and the mapping (X,Y,Z)—(1,J,K) is 
double valued. This makes the C Grid non-optimal for the Connection Machine (or for 
that matter, for any parallel machine). The problem is that when we define a parallel array 
on the (I,J,K) grid with a Fortran dimension statement and a CMF$LAYOUT command, 
points along the coordinate singularity (or close to it) that are identical (or near neighbors) 
live in the computer on processors that are far apart. If we attempt to use the field values at 
the sites above and below the coordinate singularity, we will have to doa SEND operation 
(which uses the router and is much slower than the nearest neighbor NEWS operation) 
even though these points are near neighbors in physical space. We will see that the speed 
of the final ported code is seriously affected by these unnecessary SENDs. 


Another problem with FLO67 was the way the boundary conditions were done. Recall 
the arrays W and P were padded from 1:NX to 0:NX+2 and so on. First, this is wasteful 
of space. Table 6 illustrates this. The amount of wasted volume is large for the smaller 
lattice sizes and decreases as the grid size increases. This however is not the only problem. 
The real killer is when we do multigrid. For multigrid, we need NX, NY, NZ to be powers 
of two. However, this guarantees that NX+3, NY+3, NZ+3, which are the sizes of the 
padded dimensions for W and P, are not powers of two. In multigrid problems, it is natural 
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to use power of two aray sizes. This also gives some automatic performance advantages 
from the compiler. However, this possibility was completely ruled out because of the way 
the array dimensions are padded in FLO67. Whereas in FLO67, it is possible to further 
pad the arrays to lay them out optimally on the machine in Fortran, I avoided doing this 
because it is awkward. 


In the Fortran-77 code, the way the multigrid arrays were defined is shown in Figure 4 
for three multigrid levels with grids of size 8 x 8 x 4, 16 x 16 x 8 and 32 x 32 x 16. 
Consider the field W which has 5 quantities stored in it. Because of the way the arrays 
were padded, the actual space required (in words) for W on each of these grid levels is 
4235 = 11x11x7x5, 19855 = 19x19x11x5 and 116375 = 35 x 35 x 19 x 5 respectively. 
The way this was achieved in the Fortran-77 version of FLO67 was via a one dimensional 
array called W of size (4235+19855+116375). The array was passed to FLO with a pointer 
to the appropriate place in the array depending on the multigrid level being considered. 
In FLO, the array was declared a four dimensional array of the size appropriate to that 
multigrid level. Whereas this construct (of passing a pointer to a one dimensional array 
and declaring it a multidimensional array in the subroutine) is allowed in Fortran-77, it 
is currently forbidden in CM-Fortran. Even if it were possible to use this construct, it is 
a bad idea - not just for the Connection Machine but for any parallel machine. This is 
because on parallel machines, it matters where a piece of data lives. There is always a 
communication cost for moving data on these machines. This cost will be large each time a 
1-d array is moved to an array with a different geometry because this inevitably results in 
communication over large distances on the machine which is wasteful and slow. A simple 
way to avoid using this construct was adopted for FLO67. I defined different W and P 
arrays in main for each multigrid level. The arrays passed to FLO were always the ones 
relevant to the multigrid level being considered. 


Let me briefly discuss three possible ways of implementing a multigrid algorithm on 
a parallel machine. 


1. The simplest choice (and the one I picked) is to define the arrays separately for 
each multigrid level. Thus if we have two multigrid levels with grid sizes(8,8,4) and 
(16,16,8), one defines P1(0:10,0:10,0:6) and P2(0:18,0:18,0:10) for the pressure field 
on these two levels. When moving data from Pl to P2, we will have to do global 
communications. For FLO67, the time spent moving data between grid levels was 
irrelevant compared to the time spent computing on a given grid level. This is the 
reason why this simple choice of array layout was sufficient. If the cost of moving data 
between grid levels had been more important, I would have had to use the scheme 


defined below. 


2. Another choice is to define arrays with an extra scalar index that keeps track of the 
multigrid level. There is a natural association of a neighborhood of the finer grid with 
a site of the coarse grid above it. The sites corresponding to this neighborhood can 
be labelled by the extra serial index. To illustrate in the case considered in item 1 
above, the P array would now be P(1:9,0:10,0:10,0:6) with P(1,:,:,:) corresponding 
to the coarse grid and P(2:9,:,:,:) corresponding to the fine grid. The 8 points (in d 
dimensions we have 2% such points) of the fine lattice corresponding to one point on the 
coarse lattice are mapped into the serial index using some well defined (but arbitrary) 
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ordering. This choice has the advantage that the communication costs associated with 
moving data between multigrid levels is eliminated. However, the method works only 
if the coarse grid fills the machine; ie. there are at least as many points on the coarse 
grid as there are processors on the machine. There is also a potential danger that 
the program performance might degrade because the compiler might not loop over 
the serial indices in an intelligent way. This would mean that the code has to to be 
written in a cumbersome way by unrolling the multigrid serial index. 

3. A third possibility is to use the same sized grid for all levels. Instead of the coarse 
grid being of size 1/2 compared to the fine grid, it is of the same size. One thinks 
not of one coarse grid but of 2% coarse grids. The data on the coarse grids comes 
from reducing from the 2% distinct ways of defining a coarse grid from the fine grid. 
These choices come about due to the fact that one has 2% possible origins to choose 
from for the fine—coarse reduction. In this method, all the coarse grids are processed 
and their results averaged to get the final result for the coarse grid. An advantage of 
this method is the increased accuracy from the improved statistics. A disadvantage is 
that a factor N gain in accuracy requires N* amount of work. However, for parallel 
machines, where some processors would have otherwise been idle when working on the 
coarse grid, this method may get one a big improvement with only a small time cost. 


Some Examples 


I would now like to discuss some code fragments from FLO67 and how they were 
ported. These examples are meant to illustrate the ease (and dangers) of the porting 
process and the elegance (and frustrations) of the final result. 

Consider first the code that is shown in Table 7. It is a simple example of moving 
data from one multigrid level to the next. In this case we are moving from a coarse lattice 
to a fine lattice. The only important point is to compute the range of the indices correctly 
(in this case these are II,JJ,KK) for the parallel code. The elegance (and brevity) of the 
parallel code is obvious. 

Next consider the serial code in Table 8a. A little reflection indicates that what is 
being done here is an 8 point stencil. One is adding field values on the vertices of a cube 
from the array RAD and using them to compute the array DTL. Of course one could use 
the 3-d stencil compiler to do this. But let us try to write it in CM-Fortran. Table 8b 
shows an attempt to translate this code into CM-Fortran that at first sight seems correct. 
An array RADA of the same shape and size as RAD is defined. It is initiaized to RAD 
plus RAD shifted in the I direction by unity. This results in the sum of field values at two 
nearest neighbor sites. Next, RADA is shifted along the J direction and added to itself to 
get the sum of field values at four sites and finally this is repeated for the K direction to 
get the sum over the cube vertices. Everything is done using array sections that conform 
to the way the ‘DO LOOPS’ in the serial code were defined. It all looks correct!! 

However, this code gives the wrong answer. If one runs the code with this fragment 
replacing the serial code of Table 8a, one finds that the contents of some of the elements of 
RADA are wrong. What’s going on? What is happening is that RADA is initialized by the 
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first step only over the domain (1:IL,1:JL,1:KL). Since it was an array opened on the fly 
in the subroutine, it started out containing garbage values before this step. In particular, 
the contents of RADA(1:IL,JL+1,1:KL) are unpredictable as they are never defined by the 
code. When these values are used in line 2 of the code in Table 8b, the result is nonsense. 
The correct code is given in Table 8c. Note that I have used EOSHIFT. Using CSHIFT 
would have worked but EOSHIFT is faster for array sizes that are not powers of two. 

Table 9a shows a rather larger and convoluted section of serial code and Table 9b 
shows its parallel equivalent. It is clear that what the serial code is doing is compute the 
location and value of the maximum of some functions of the fields W and P. The parallel 
code shows how to do this using the MAXLOC function of CM-Fortran. This function 
computes the location of the maximum value in the array and puts it in ILOC. Note 
however that MAXLOC returns the location of the maximum relative to the first index in 
the array - so it implicitly assumes that the indices start at unity. If the array dimensions 
start from 0 instead of 1 (as is the case in FLO67), one must subtract unity from ILOC to 
get the location of the maximum. 


Timings 


In Table 10 I show the percentage of time spent in various routines called from EULER 
(see Figure 1) for the Fortran-77 code compiled and run on a SUN4 and the CM-Fortran 
code run on a CM-2 using the PARIS and SLICEWISE compiler. Note the dramatic 
increase in time spent in BCWING and HALO (20-25 percent in each compared to 0.5-2 
percent). These routines are quite innocuous in the Fortran-77 version. They work on 
the boundaries, averaging the fields below and above the cut in the C Grid along the x 
axis. The reason they slow down on the CM-2 is that the field values above and below 
the cut live in very distant processors and averaging over them involves using the router. 
Currently, the code spends about 40 percent of its time in the ROUTER communications 
involved in these boundary conditions. This could be improved using the router compiler 
which does simulated annealing to find the optimum array layout and then uses muliwire 
news to communicate. 

In Table 11, I list the code that I wrote for HALO. I show both the serial code and its 
translation into CM-Fortran. The code for BCWING is similar. It is a challenge to try to 
speed up these routines. We are attempting to do this currently for the CM-200 and the 
CM-5. 


Final Words 


Currently, FLO67-CM can do a 480 x 112 x 112 system with 5 levels of multigrid on 
a 64K CM-2. This is a system with about 7 million grid points and 41 million variables. 
The code runs 15 times faster compared to a 2 head Convex. 

It should be possible to improve its speed by using three dimensional stencils and 
compiled routing. The code runs without change on the CM-5 and the CM-200. 
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TABLE 2 


SUBROUTINE INIT-START (W,P) 
COMMON/LIM/ NX,NY,NZ,IL,JL,KL,IE,JE,KE 


COMMON/DAT/ GAMMA,RM,AL,RHOO,P0,EIO, 
H0,C0,U0,V0,W0,CA,SA,CDO 

DIMENSION W(0:1E,0:JE,0:KE,5) 

DIMENSION P(0:IE,0:JE,0:KE) 

DO 10 K=0,KE 

DO 10 J=0,JE 

DO 10 I=0,JE 

W(I,J,K,1) = RHOO 

W(L,J,K,2) = RHOO*U0 

W(I,J,K,3) = RHOO*VO 

W(I,J,K,4) = RHOO*WO 

W(L,J,K,5) = RHOO*HO -PO 

P(I,J,K) = PO 

10 CONTINUE 

RETURN 


END 


TABLE 3 


SUBROUTINE INIT-MID (W,P) 
COMMON/LIM/ NX,NY,NZ,IL,JL,KL,IE,JE,KE 
COMMON/DAT/ GAMMA,RM,AL,RHOO,PO,EIO, 


H0,C0,U0,V0,W0,CA,SA,CDO 

DIMENSION W(0:IE,0:JE,0:KE,5) 

DIMENSION P(0:IE,0:JE,0:KE) 

REAL, ARRAY(5,0:IE,0:JE,0:KE) :: WCM 

REAL, ARRAY(0:1E,0:JE,0:KE) :: POM 

INTEGER, ARRAY :: LENGTH(3) 
CMF $ LAYOUT WCM(:SERIAL,:NEWS,:NEWS,:NEWS) 
CMF $ LAYOUT PCM(:NEWS,:NEWS,:NEWS) 

LENGTH(1)=IE+1 

LENGTH(2)=JE+1 

LENGTH(3)=KE+1 
c CALL CMF_FE_ARRAY-TO-.CM(WCM(1,:,:,:),W(0,0,0,1)) 
¢ CALL CMF_FE_-ARRAY_TO-CM(WCM(2,:,:,:),W(0,0,0,2)) 
c CALL CMF_FE_ARRAY-TO-CM(WCM(3,:,:,:),W(0,0,0,3)) 
c CALL CMF_FE_ARRAY-TO.CM(WCM(4,:,:,:),W(0,0,0,4) 
c CALL CMF_FE_ARRAY-TO.CM(WCM(5,:,:,:),W(0,0,0,5) 
c CALL CMF_FE_ARRAY-TO_CM(PCM(:,:,:),p(0,0,0)) 

WCM(1,:,:,:)=RHOO 


) 
) 


WCM(2,:,:,:)=RHOO0*U0 
WCM(3,:,:,:)=RHOO*VO 
WCM(4,:,:,:)=RHOO*WO 
WCM(5,:,:,:)=RHO0*HO-PO 


TABLE 3 (cont.) 


PCM=P0 
CALL CMF_FE_ARRAY_FROM_CM(W(0,0,0,1),WCM(1,:,:,: 


:)) 
CALL CMF_FE_-ARRAY_FROM_CM(W(0,0,0,2),WCM(2,:,:,:)) 
CALL CMF _FE_ARRAY_FROM-_CM(W(0,0,0,3),WCM(3,:,:,:)) 
CALL CMF_FE_ARRAY_FROM_CM(W(0,0,0,4),WCM(4,:,:,:)) 
CALL CMF_FE-ARRAY_FROM_CM(p(0,0,0),PCM(:,:,:)) 


RETURN 
END 


TABLE 4 


GET_ARCH = A=“‘arch‘ 
OUFFIXES: .f .o 

FC = f77 -c -v 

LK = cmf -fecommon 

CMF = cmf -c -fecommon -paris 


FFILES = flo67.0 addw.o bcfar.o bewing.o bcbody.o bound.o 
collc.o dflux.o dfluxc.o timit.o eflux.o 
euler.o flo.o halo.o CMinit.o metric.o normal.o 
psmoo.o setps.o step.o gmesh.o input.o intpl.o mesh.o 
monitr.o output.o prntff.o prntm.o prntx.o secprp.o singl.o 


splif.o surf.o totfor.o wing.o coord.o dimco.o dimfin.o forcf.o 
flo67: $(FFILES) 

$(GET_ARCH); echo ” Making $$A Program ”; 

$(LK) $(FFILES) -o flo67x 
Be 

$(FC) $< 
CMinit.o: CMinit.fcm 

$(CMF) CMinit.fcm 


TABLE 5 


SUBROUTINE INIT-FINAL (W,P) 
COMMON/LIM/ NX,NY,NZ,IL,JL,KL,IE,JE,KE 
COMMON/DAT/ GAMMA,RM,AL,RHOO,P0,EI0,HO, 

C0,U0,V0,W0,CA,SA,CDO 
REAL, ARRAY(5,0:IE,0:JE,0:KE) :: W 
REAL, ARRAY(0:1E,0:JE,0:KE) :: P 
CMF$LAYOUT W(:SERIAL,:NEWS,:NEWS,:NEWS) 


CMF$LAYOUT P(:NEWS,:NEWS,:NEWS) 


WV (1,3:;:) =RHOO 
W(2;,:,:,:)=RHOO0*U0 
W(3;,:,:,:)=RHOO*VO 
W(4;,:,:,:)=RHOO*WO0 
W(5,:,:,:)=RHOO*HO0-PO 
P=P0 

RETURN 


END 


TABLE 6 


USEFUL NX, NY, NZ IE, JE, KE % 
SITES WASTED 


256 8X8X4 = F4¥1 1X7 69.8 
2048 16X16X8 19X19X11 48.4 


16384 32X32X16 35X35X19 29.6 


TABLE 7 


SERIAL CODE 


KK = 0 

DO 10 K=1,KL,2 

KK = KK +1 

JJ =0 

DO 10 J=1,JL,2 

JJ = JJ +1 

Il=0 

DO 10 I=1,IL.2 

Il = 11 41 

DW(I,J,K,N) = WW(II,JJ,KK,N) 
10 CONTINUE 


PARALLEL CODE 


II = NX/241 
JJ = NY/241 
KK = NZ/241 


DW(1:1L:2,1:JL:2,1:KL:2) = WW(LIL1:JJ,1:KK) 


TABLE 8a 


SERIAL CODE 


DO 70 K=1,KL 
N=K 41 
DO 70 J=1,JL 
M=J+41 

DO 70 I=1,IL 
L=I41 


RADA = RAD(LJ,K) +RAD(I,M,K) +RAD(LJ,N) +RAD(LM,N) 
/ +RAD(L,J,K) +RAD(L,M,K) +RAD(L,J,N) +RAD(L,M,N) 
DTL(I,J,K) = CFL/RADA 

70 CONTINUE 


TABLE 8b 


PARALLEL CODE 


(WRONG RESULTS) 


Assume that RADA is a temp array of the same shape and size as rad 


RADA(1:IL,1:JL,1:KL)=RAD(1:IL,1:JL,1:KL)+RAD(1:1L,1:JL,2:KL+1) 
RADA(1:1L,1:JL,1:KL)=RADA(1:IL,1:JL,1:KL)+RADA (1:IL,2:JL+1,1:KL) 
RADA(1:IL,1:JL,1:KL)=RADA(1:IL,1:JL,1:KL)+RADA(2:IL+1,1:JL,1:KL) 
DTL(1:IL,1:JL,1:KL)=CFL/RADA(1:IL,1:JL,1:KL) 


TABLE 8c 


PARALLEL CODE 
(CORRECT RESULTS) 


Assume that RADA is a temp array of the same shape and size as rad 


RADA= RAD+EOSHIFT(RAD,1,1) 

RADA= RADA+EOSHIFT(RADA,2,1) 

RADA= RADA+EOSHIFT(RADA,3,1) 
DTL(1:IL,1:JL,1:KL)=CFL/RADA (1:IL,1:JL,1:KL) 


85 


90 


TABLE 9a 


SERIAL CODE 


DO 90 K=1,KL 
N=K 41 
DO 90 J=1,JL 
M=J41 

DO 90 I=1,IL 
L=I141 


H = (W(LJ,K,5) +P(LJ,K))/W(I,J,K,1) -HO 
IF (ABS(H).LE.ABS(HMAX)) GO TO 85 

HMAX = H 

IH =I 

JH =J 

KH = K 

VOLA = VOL(I,J,K) +VOL(I,M,K) +VOL(I,J,N) +VOL(I,M,N) 
- +VOL(L,J,K) +VOL(L,M,K) +VOL(L,J,N) +VOL(L,M,N) 
RT = (W(1,J,K,1) -WN(I,J,K,1))/(DTL(I,J,K)*VOLA) 

IF (ABS(RT).LE.ABS(RTMAX)) GO TO 90 

RTMAX = RT 

IRT =I 

JRT = J 

KRT = K 

CONTINUE 


TABLE 9b 


PARALLEL CODE 


Assume that H and VOLA are real arrays and MASK is a logical array 
Both are of the same shape and size as P. 


ILOC is an integer array of size 3. 


MASK = .FALSE. 
MASK(1:IL,1:JL,1:KL)=.TRUE. 

H = W(5,:,:,:) + P/W(1,:,:,:) - HO: 

H = ABS(H) 

ILOC = MAXLOC(H,MASK=MASK) 
IH=ILOC(1)-1 

JH=ILOC(2)-1 

KH=ILOC(3)-1 

HMAX=H(IH,JH,KH) 

VOLA = VOL +EOSHIFT(VOL,1,1) 
VOLA = VOLA +EOSHIFT(VOLA,2,1) 
VOLA = VOLA +EOSHIFT(VOLA,3,1) 
WHERE(MASK) 

H = (W(1,:,:,:) - WN(1,:,:,:))/(DTL*VOLA) 
ENDWHERE 

ILOC = MAXLOC(H,MASK=MASK) 
IRT=ILOC(1)-1 

JRT=ILOC(2)-1 

KRT=ILOC(3)-1 

RIMAX = H(IRT,JRT,KRT) 


TABLE 10 


% TIME FROM EULER 


ROUTINE 


BCWING 
EFLUX 
HALO 
PSMOO 
DFLUX 
BCBODY 
DFLUXC 


BCFAR 


SLICE 


20.1 


24.6 


25.4 


13.1 


9.4 


af 


2.3 


1.4 


Wf 


0.5 


42.8 


1.9 


18.5 


31.0 


0.3 


4.4 


0.1 


TABLE 11 


This is an example of porting a subroutine. The commented-out, up- 
percase commands are from the original, F77 listing. The lowercase com- 
mands immediately below are the CM-Fortran commands that replace the 
commented lines. 


SUBROUTINE HALO (W,P) 
COMMON/DAT/GAMMA,RM,AL,RHOO,P0,EI0,H0,CO,U0,V0,W0,CA,SA,O 
COMMON/LIM/ NX,NY,NZ,IL,JL,KL,IE,JE,KE 
COMMON/IKW/ KTIP,ITL ITU 

C DIMENSION W(0:IE,0:JE,0:KE,5) 

C DIMENSION P(0:IE,0:JE,0:KE) 
real, array(5,0:ie,0:je,0:ke) :: w 
real, array(0:ie,0:je,0:ke) :: p 
real, array(0:ie,0:je,0:ke) :: qq,rr,rl,r2,r3 
logical, array(0:ie,0:je,0:ke):: mask1,mask2 

CMF$ LAYOUT w(:serial,:news,:news,:news) 

CMF$ LAYOUT p(:news,:news,:news) 

CMF$ LAYOUT qq(:news,:news,:news) 

CMF$ LAYOUT rr(:news,:news,:news) 

CMF$ LAYOUT r1(:news,:news,:news) 

CMF$ LAYOUT r2(:news,:news,:news) 

CMF$ LAYOUT r3(:news,:news,:news) 

CMF$ LAYOUT mask1(:news,:news,:news) 

CMF$ LAYOUT mask2(:news,:news,:news) 


0 OR Se Ae aS eR ae he ae: Ps a a 


TABLE 11 (cont.) 


DO 10 K=1,KL 

DO 10 J=1,JL 

W(0,J,K,1) = W(1,J,K,1) +W(1,J,K,1) -W(2,J,K,1) 
W(0,J,K,2) = W(1,J,K,2) +W(1,J,K,2) -W(2,J,K,2) 
W(0,J,K,3) = W(1,J,K,3) +W(1,J,K,3) -W(2,J,K,3) 
W(0,J,K,4) = W(1,J,K,4) +W(1,J,K,4) -W(2,J,K,4) 
W(0,J,K,5) = W(1,J,K,5) +W(1,J,K,5) -W(2,J,K,5) 
P(0,J,K) = P(1,J,K) +P(1,J,K) -P(2,J,K) 


10 CONTINUE 
ccccccecccceccccecececeececcececceceececececeeeeceeecccce 
w(1,0,1:j1,1:k1)=2.*w(1,1,1:jl,1:kl)-w(1,2,1:j1,1:kl) 
w(2,0,1:j1,1:k])=2.*w(2,1,1:jl,1:kl)-w(2,2,1:j1,1:k1) 
w(3,0,1:j1,1:kl)=2.*w(3,1,1:j1,1:kl)-w(3,2,1:jl,1:k1) 
w(4,0,1:j1,1:kl)=2.*w(4,1,1:jl,1:kl)-w(4,2,1:j1,1:kl) 
w(5,0,1:j1,1:kl)=2.*w(5,1,1:j1,1:kl)-w(5,2,1:jl,1:k]) 
p(0,1:j1,1:kl)=2.*p(1,1:j1,1:kl)-p(2,1:j1,1:k1) 


TABLE 11 (cont.) 


C DO 20 K=1,KL 

C DO 20 J=1,JL 

C W(IE,J,K,1) = W(IL,J,K,1) +W(IL,J,K,1) -W(NX,J,K,1) 
C W(IE,J,K,2) = W(IL,J,K,2) +W(IL,J,K,2) -W(NX,J,K,2) 
C W(IE,J,K,3) = W(IL,J,K,3) +W(IL,J,K,3) -W(NX,J,K,3) 
C W(IE,J,K,4) = W(IL,J,K,4) +W(IL,J,K,4) -W(NX,J,K,4) 
C W(IE,J,K,5) = W(IL,J,K,5) +W(IL,J,K,5) -W(NX,J,K,5) 
C P(IE,J,K) = P(IL,J,K) +P(IL,J,K) -P(NX,J,K) 


C20 CONTINUE 

C  cecceccceceeceeeeceeceeceeeeecceeeeceeeeeeeecececcececcecce 
w(1,ie,1:jl,1:k1)=2.*w(1 ie-1,1:j1,1:kl)-w(1,ie-2,1:jl,1:kK1) 
w(2,ie,1:jl,1:k1)=2.*w(2,ie-1,1:jl,1:kl)-w(2,ie-2,1:jl,1:kl) 
w(3,ie,1:j],1:k1)=2.*w(3,ie-1,1:j1,1:kl)-w(3,ie-2,1:jl,1:kl) 
w(4,ie,1:jl,1:kl)=2.*w(4,ie-1,1:j1,1:kl)-w(4,ie-2,1:jl,1:k1) 
w(5,ie,1:jl, eit: 2.*w(5,ie-1,1:jl,1:kl)-w(5,ie-2,1:jl,1:kl) 

p(ie,1:jl,1:kl)=2.*p(ie-1,1:jl,1:kl)-p(ie-2,1:jl,1:kl) 


TABLE 11 (cont.) 


C DO 25 K=1,KL 
C DO 25 I=1,IL 

GC ifP= fF 

C W(I,0,K,1) = W(II,2,K,1) 
C W(I,0,K,2) = W(II,2,K,2) 
C W(I,0,K,3) = W(IL,2,K,3) 
C W(I,0,K,4) = W(IL2,K,4) 
C W(I,0,K,5) = W(II,2,K,5) 
C P(I,0,K) = P(II,2,K) 


C25 CONTINUE 

C ccceccececccececcecececcceccecccccecececccecececceccccceccce 
w(1,1:il,0,1:kl)=w/(1,ie-1:1:-1,2,1:kl) 
w(2,1:i1,0,1:kl)=w/(2,ie-1:1:-1,2,1:kl) 
w(3,1:i1,0,1:k1)=w(3,ie-1:1:-1,2,1:kl) 
w(4,1:i1,0,1:kl)=w/(4,ie-1:1:-1,2,1:kl) 
w(5,1:i1,0,1:k1)=w(5,ie-1:1:-1,2,1:k1) 
p(1:i1,0,1:kl)=p(ie-1:1:-1,2,1:k1) 


es Sh ae a NO: re SO Nie I lk ee ee a oa ae. 


TABLE 11 (cort.) 


Il = ITL +1 
f= ITU -1 

DO 30 K=1,KTIP 

DO 30 I=I1,I2 

W(I,0,K,1) = W(1L,1,K,1) +W(I,1,K,1) -W(1,2,K,1) 
W(I,0,K,2) = W(L,1,K,2) +W(L1,K,2) -W(I,2,K,2) 
W(I,0,K,3) = W(I,1,K,3) +W(I,1,K,3) -W(I,2,K,3) 
W(I,0,K,4) = W(I,1,K,4) +W(I,1,K,4) -W(I,2,K,4) 
W(I,0,K,5) = W(1,1,K,5) +W(I,1,K,5) -W(I,2,K,5) 
P(I,0,K) = P(I,1,K) +P(I,1,K) -P(I,2,K) 


30 CONTINUE 
ccccccececececececececececcecececeeecececcecceeccceecccccccce 
w(1,i1:12,0,1:ktip)=2.*w(1,i1:i2,1,1:ktip)-w(1,i1:i2,2,1:ktip) 
w(2,i1:12,0,1:ktip)=2.*w(2,i1:i2,1,1:ktip)-w(2,i1:i12,2,1:ktip) 
w(3,i1:12,0,1:ktip)=2.*w(3,i1:i2,1,1:ktip)-w(3,i1:12,2,1:ktip) 
w(4,i1:12,0,1:ktip)=2.*w(4,i1:i2,1,1:ktip)-w(4,i1:12,2,1:ktip) 
w(5,i1:12,0,1:ktip)=2.*w(5,i1:i2,1,1:ktip)-w(5,i1:12,2,1:ktip) 
p(il:i2,0,1:ktip)=2.*p(il:i2,1,1:ktip )-p(i1:i2,2,1:ktip) 


Caer Ca Co >) CO Ceo. C2 


TABLE 11 (cont.) 


DO 40 K=1,KL 
DO 40 I=1,IL 
W(LJE,K,1) = W(LJL,K,1) +W(LJL,K,1) -W(I,NY,K,1) 
W(LJE,K,2) = W(LJL,K,2) +W(1,JL,K,2) -W(L,NY,K,2) 
W(LJE,K,3) = W(LJL,K,3) +W(I,JL,K,3) -W(I,NY,K,3) 
W(LJE,K,4) = W(L,JL,K,4) +W(L,JL,K,4) -W(LNY,K,4) 
W(LJE,K,5) = W(1,JL,K,5) +W(LJL,K,5) -W(LNY,K,5) 
P(I,JE,K) = P(1,JL,K) +P(I,JL,K) -P(I,NY,K) 

40 CONTINUE 
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC 


w(1,1:il je,1:k)= 
w(2,1:il,je,1:k1) 
w(3,1:il,je,1:kl)=2 
w(4,1:il,je,1:kl)=2.*w(4,1:il,je-1,1:kl) 
w(5,1:il,je,1:kl)=2.*w(5,1:il je-1,1:kl) 
p(1:il,je,1:kl)=2.*p(1:il,je-1,1:kl)-p(1 sil ,je-2,1:kl) 


=2.*w(2,1:ilje-1,1:kl) 


2.*w(1,1:il,je-1,1:kl) 


*w(3,1:il je-1,1:k]) 


-w(1,1:il,je-2,1:k1) 
-w(2,1:il,je-2,1:k1) 
-w(3,1:il,je-2,1:k1) 
-w(4,1:il,je-2,1:k1) 
-w(5,1:il,je-2,1:k1) 


TABLE 11 (cont.) 


C DO 50 J=1,JL 

C DO 50 I=1,IL 

C W(LJ,0,1) = W(1,J,1,1) +W(LJ,1,1) -W(,J,2,1) 
C W(I,J,0,2) = W(1,J,1,2) +W(I,J,1,2) -W(I,J,2,2) 
C W(LJ,0,3) = W(I,J,1,3) +W(I,J,1,3) -W(I,J,2,3) 
C W(IJ,0,4) = W(1,J,1,4) +W(I,J,1,4) -W(I,J,2,4) 
C W(IJ,0,5) = W(1,J,1,5) +W(I,J,1,5) -W(I,J,2,5) 
G 


P(I,J,0) = P(I,J,1) +P(I,J,1) -P(1,J,2) 

C50 CONTINUE 

C  cccccccecceeccececececccececccccccececceececececcececceccce 
W(1,1:i1,1:j1,0) = 2.*W(1,1:i1,1:j1,1) - W(1,1:i1,1:31,2) 
W(2,1:i1,1:j1,0) = 2.*W(2,1:i1,1:j1,1) - W(2,1:il,1:j1,2) 
W(3,1:i1,1:j1,0) = 2.*W(3,1:i,1:j1,1) - W(3,1:i1,1:j1,2) 
W(4,1:i1,1:j1,0) = 2.*W(4,1:il,1:jl,1) - W(4,1:i1,1:j1,2) 
W(5,1:i1,1:jl,0) = 2.*W(5,1:il,1:j1,1) - W(5,1:i1,1:j1,2) 
p(1:il,1:j1,0) = 2.*p(1:il,1:j1,1) - p(1:il,131,2) 


Co Cage. £2 Cee. Cl te eo 


TABLE 11 (cont.) 


DO 60 J=1,JL 

DO 60 I=1,IL 

W(LJ,KE,1) = W(I,J,KL,1) +W(I,J,KL,1) -W(LJ,NZ,1) 
W(I,J,KE,2) = W(I,J,KL,2) +W(LJ,KL,2) -W(1,J,NZ,2) 
W(I,J,KE,3) = W(L,J,KL,3) +W(I,J,KL,3) -W(LJ,NZ,3) 
W(1,J,KE,4) = W(I,J,KL,4) +W(I,J,KL,4) -W(I,J,NZ,4) 
W(IJ,KE,5) = W(1,J,KL,5) +W(I,J,KL,5) -W(1,J,NZ,5) 
P(I,J,KE) = P(I,J,KL) +P(I,J,KL) -P(1,J,NZ) 


60 CONTINUE 


cccccccecceccccccecceceececececeeceeceeeccececeececececccce 
w(1,1:il,1:jl,ke)=2.*w(1,1:il,1:jl,ke-1)-w(1,1:il,1:jl,ke-2) 
w(2,1:i1,1:jl,ke)=2.*w(2,1:i1,1:jl,ke-1)-w(2,1:il,1:jl,ke-2) 
w(3,1:il,1:jl,ke)=2.*w(3,1:i1,1:jl,ke-1)-w(3,1:il,1:jl,ke-2) 
w(4,1:i1,1:jl,ke)=2.*w(4,1:il,1:jl,ke-1)-w(4,1:il,1:jl,ke-2) 
w(5,1:i1,1:jl,ke)=2.*w(5,1:il,1:jl,ke-1)-w(5,1:il,1:jl,ke-2) 
p(1:il,1:jl,.ke)=2.*p(1:il,1:jl,ke-1)-p(1:il,1:jl,ke-2) 


TABLE 11 (cont.) 


DO 70 K=1,KL 
Il = NX/2 

IF (K.LE.KTIP) ll = ITL 

DO 70 I=2,11 

Il = IE-I 

W(1,1,K,1) = .5*(W(I,1,K,1) +W/(II,1,K,1)) 
W(I,1,K,2) = .5*(W(I,1,K,2) +W(II,1,K,2)) 
W(I,1,K,3) = .5*(W(I,1,K,3) +W/(II,1,K,3)) 
W(I,1,K,4) = .5*(W(L1,K,4) +W(IL,1,K,4)) 
W(I,1,K,5) = .5*(W(L1,K,5) +W(IL1,K,5)) 

QQ = W(L,1,K,2)**2 +W(L,1,K,3)**2 +W(I,1,K,4)**2 
EQQ = .5*QQ/W(L1,K,1) 

P(I,1,K) = (GAMMA -1.)*DIM(W(L1,K,5),EQQ) 
W(II,1,K,1) = W(1,1,K,1) 

W(II,1,K,2) = W(I,1,K,2) 

W(IL,1,K,3) = W(I,1,K,3) 

W(II,1,K,4) = W(L,1,K,4) 

W(II,1,K,5) = W(L1,K,5) 

P(II,1,K) = P(1,1,K) 

C70 CONTINUE 


C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC 


Oe OR! £2 Ch C3 Ce ee eee ee oe, CF OO ey Oy ee 


TABLE 11 (cont.) 


nxh=nx/2 

ktipp=ktip+1 

nxitl=ie-itl 

nxhh=ie-nxh 

mask1=.false. 
mask1(2:itl,1,1:ktip)=.true. 
mask1(2:nxh,1,ktipp:kl)=.true. 
r1(2:nxh,1,1:kl)=w(1,nx:nxhh:-1,1,1:kl) 
r2(2:nxh,1,1:kl)=w(2,nx:nxhh:-1,1,1:kl) 
13(2:nxh,1,1:kl)=w(3,nx:nxhh:-1,1,1:kl) 
qq(2:nxh,1,1:kl)=w(4,nx:nxbh:-1,1,1:kl) 
rr(2:nxh,1,1:kl)=w(5,nx:nxhh:-1,1,1:kl) 
where(mask1) 
w(1,33:)=5*(w(1, 
w(2,5:)=5*(w (2, 
w(3,:,:,:)=.5"(w(3,: 
w (4,22) =.5*(w(4y 
w(5,:,:5:)=.5"(w(5,: 


endwhere 


WwW 


mask2=.false. 
mask2(nxhh:nx,1,ktipp:kl)=.true. 
mask2(nxitl:nx,1,1:ktip)=.true. 


TABLE 11 (cont.) 


ri(nxhh:nx,1,1:kl)=w/(1,nxh:2:-1,1,1:kl) 
r2(nxhh:nx,1,1:k1)=w(2,nxh:2:-1,1,1:kl) 
13(nxhh:nx,1,1:kl)=w(3,nxh:2:-1,1,1:kl) 
qq(nxhh:nx,1,1:kl)=w(4,nxh:2:-1,1,1:kl) 
tr(nxhh:nx,1,1:k1)=w(5,nxh:2:-1,1,1:kl) 
where(mask2) 

wil 9=t1 

wiz =22 

Wi,.i0=t3 

w(4,:,:,:)=qq 

w(5,:,:,)) =f 

endwhere 

COQ = W( 2,332)? "2 eee ee 4) 
QQ(2:nx,:,1:kl) = .5*QQ(2:nx,:,1:kK1) /W(1,2:nx,:,1:k1) 
iT=w(95,:,:;:) 

qq=max(rr-qq,0.0) 

qq=(gamma-1.)*qq 

where(mask1) p=qq 

where(mask2) p=qq 

RETURN 

END 
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